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Abstract 

Recently several authors have proposed stochastic evolutionary models for the growth 
of the web graph and other networks that give rise to power-law distributions. These 
models are based on the notion of preferential attachment leading to the "rich get richer" 
phenomenon. We present a generalisation of the basic model by allowing deletion of 
individual links and show that it also gives rise to a power-law distribution. We derive 
the mean- field equations for this stochastic model and show that, by examining a snapshot 
of the distribution at the steady state of the model, we are able to tell whether any link 
deletion has taken place and estimate the probability of deleting a link. Applying our 
model to actual web graph data gives some insight into the distribution of inlinks in 
the web graph and provides evidence of link deletion and the extent to which this has 
occurred. Our analysis of the data also suggests a power-law exponent of approximately 
2.15 rather than the widely published value of 2.1. 

1 Introduction 

Power-law distributions taking the form 

fii)=Ci-\ (1) 

where C and r are positive constants, are abundant in nature |Sch91j . The constant r is 
called the exponent of the distribution. Examples of such distributions are: Zipf's law, which 
states that the relative frequency of words in a text is inversely proportional to their rank, 
Pareto 's law, which states that the number of people whose personal income is above a certain 
level follows a power-law distribution with an exponent between 1.5 and 2, Lotka's law, which 
states that the number of authors publishing a prescribed number of papers is inversely 
proportional to the square of the number of publications, and Gutenberg-Richter's law, which 
states that the number of earthquakes over a period of time having a certain magnitude is 
roughly inversely proportional to the magnitude. 

Recently several researchers have detected asymptotic power-law distributions in the 
topology of the World- Wide- Web BKM+Ofll lDKM"'"n2] and, in parallel, researchers from 



a variety of fields are trying the explain the emergence of these power laws in terms of the 
evolution of complex networks. See |AB021 IDM021 INew03j for comprehensive surveys of re- 
cent work in this area, detailing various stochastic models that can explain the evolution of 
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the web and other networks such as the internet, citation networks, collaboration networks 
and biological networks. A common theme in these models is that of preferential attachment, 
which results in the "rich get richer" phenomenon, for example, when new links to web pages 
are added in proportion to the number of currently existing links to these pages. Related ap- 
proaches are the general theoretical model covering both directed and undirected web graphs 
|CF03| . the stochastic multiplicative process in which nodes appear at different times and the 
rate of addition of links varies between nodes |AHfllj . and the statistical physics approach 
that uses the rate equation technique |KR02j . 

To explain situations where pure preferential attachment models fail, we |LFLWn2] and 
others |PFL"'"02] have previously proposed extensions of the stochastic model for the web's 
evolution in which the addition of links is prescribed by a mixture of preferential and non- 
preferential mechanisms. In |LFLW02] . we devised a general stochastic model involving the 
transfer of balls between urns; this also naturally models quantities such as the numbers of 
web pages in and visitors to a web site, which are not naturally described in graph-theoretic 
terms. We note that our urn model is an extension of the stochastic model proposed by 
Simon in his visionary paper published in 1955 |Sim55j . which was couched in terms of word 
frequencies in a text. We also considered an alternative extension of Simon's model in |FLL02j 
by adding a preferential mechanism for discarding balls from urns (corresponding to deleting 
web pages); this results in an exponential cutoff in the power-law distribution. 

Our urn transfer model is a stochastic process, in which at each step with probability p a 
new ball (which might represent a web page) is added to the first urn and with probability 
1 — p a ball in some urn is moved along to the next urn. We assume that a ball in the ith urn 
has i pins attached to it (which might represent web links). It is known that the steady-state 
distribution of this model is a power law, with exponent r = 1 + 1/(1 -p) |Sim55l lLFLW02j . 
As mentioned above, the power-law distribution breaks down when balls may be discarded, 
resulting in a power-law distribution with an exponential cutoff. So the question arises as 
to whether it also breaks down when removal of pins is allowed. We answer this question 
here by showing that the power-law distribution does not break down, under the constraint 
that more balls are added to the system than are removed. (When the only remaining pin is 
removed from a ball in the first urn that ball is removed from the system.) This model gives 
a more realistic explanation for the emergence of power laws in complex networks than the 
basic model without link deletion (i.e. pin removal). Considering, for example, the web graph, 
the modification we make to the basic model is that after a web page is chosen preferentially, 
say according to the number of its inlinks, there is a small probability that some link to this 
page will be deleted. (This is equivalent to deleting a link chosen uniformly at random.) 
A possible reason for this may be that popular web pages compete for inlinks, so that the 
number of inlinks a page has acquired will fluctuate with its popularity. This is evident, for 
example, in the rise and fall in the popularity of several search engines. Link deletion has also 
been considered by Dorogovtsev and Mendes |DM01j in the context of a different model in 
which, at each time step, a new node is added with just one link, which preferentially attaches 
itself to an existing node. In addition, at each time step m links between existing nodes are 
added to preferentially chosen nodes and c links between existing nodes are deleted, again 
preferentially. Their conclusion that link deletion increases the power-law exponent is also 
obtained here for our stochastic model. 

Consider a power law such as Lotka's law jNic89j . If this holds, then a plot on a log-log 
scale of the number of authors (the frequency) against the number of publications (the value) 
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should reveal a straight line with a negative slope of around —2. There is an obvious problem 
with a log-log transformation if any of the frequencies are zero; that is to say, when there 
are values vi and V2, with vi < V2, for which no author published vi papers but at least one 
author published V2 papers (i.e. there is a gap in the values for the number of publications). 

In general, we expect such gaps to occur in any data set, mainly for large values. (This 
is due to the stochastic variation and the fact that the frequencies have to be integral.) This 
observation is consistent with the finding that Lotka's law does not fit well for large values 
(i.e. authors who published a large number of papers) and this tail region is characterised by 
the presence of gaps jNic89j . 

One way of dealing with the problem of gaps is simply to ignore all value-frequency pairs 
where the frequency is zero. However, ignoring gaps in this way seems questionable, since 
the zero frequency values do give relevant information about the data set and should not 
be treated as missing values. Our unease with this approach is reinforced by the fact that 
computing the exponent in this way results in a much lower value; for example, for the web 
inlink data |BKM+nn , it gives a value around 1.5, rather than the generally accepted figure 



of 2.1. 

An alternative approach is to squash the non-zero frequencies up towards the y-axis (i.e. 
the frequency axis), after ignoring the zero frequency values. This is equivalent to omitting 
from the data set those values for which there are no authors having that number of pub- 
lications and then ranking the remaining values in increasing order, i.e. renumbering the 
values starting from one. This method also seems somewhat dubious, since the power-law 
relationship should really involve the actual values rather than their ranks. 

The standard technique for fitting a power-law distribution uses linear regression on log- 
log transformed data, which is not possible if gaps are present. Another approach for handling 
gaps is to consider only the values up until the first gap; this approach is only reasonable 
if the first gap does not occur at too low a value. We call this the unranked approach, and 
note that this approach ignores the large values in the tail of the distribution. The unranked 
approach suggests one possible solution, but other approaches, such as smoothing or using 
the Hill plot [DdROOj . are possible. Ad-hoc approaches have the disadvantage that they are 
hard to justify, and the Hill plot, as originally defined, is only applicable after the data has 
been sorted, and this also seems difficult to justify. 

Another solution for handling gaps is the second method described above of ranking the 
values with non-zero frequencies and squashing up; we call this the ranked approach. In 
general, we should not expect the ranked and unranked approaches to yield the same power- 
law exponent. We would expect the exponent computed by linear regression on the log-log 
transformed data from the unranked approach to be somewhat higher. One of the findings of 
this paper is that for inlinks in the web graph the ranked and unranked approaches lead to 
a small but noticeable difference between the exponents. (Our previous comments indicate 
why we have misgivings with the ranked approach.) 

The rest of the paper is organised as follows. In Section [^J we present an urn transfer 
model that extends Simon's model by allowing a pin, chosen uniformly at random, to be 
discarded with some specified probability. In Section we derive the steady state distribution 
of the model, which, as stated, follows an asymptotic power law like In Section 0] we 
show that, by examining a snapshot of the distribution of balls in urns at the steady state, 
we are able to tell whether removal of pins has taken place and estimate the pin removal 
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probability. In Section [51 we utilise our urn model to describe a discrete stochastic process 
that simulates the evolution of the degree distribution of inlinks in the web graph. As a proof 
of concept, we analyse the May 1999 data set for inlinks presented in |BKM"'"do] . which we 
obtained from Ravi Kumar at IBM. We are able to show that our model is consistent with 
the data and determine the extent to which link deletion has occurred. We also investigate 
the discrepancy between the ranked and unranked approaches. With the ranked approach, 
using linear regression on the log-log transformed data, an exponent of 2.1 is obtained, as in 
|BKM"'"do] . However, the unranked approach results in an exponent of 2.15, which is shown 
to be consistent with our stochastic model. Although the difference between 2.1 and 2.15 may 
not seem significant, it has been remarked in jBKM+nO) that "2.1 is in remarkable agreement 
with earlier studies" and in DKM"'"n2 that the exponent is "reliably around 2.1 (with little 
variation)", which justifies us in making a distinction between the exponents obtained using 
the ranked and unranked approaches. Finally, in Section |H1 we give our concluding remarks. 



2 An Urn Transfer Model 

We now present an urn transfer model for a stochastic process with urns containing balls 
(which might represent web pages) that have pins (which might represent either inlinks or 
outlinks) attached to them. Our model allows for pins to be discarded with a small probability. 
This model can be viewed as an extension of Simon's model |Sim55j . We note that there is 
a correspondence between the Barabasi and Albert model |BA99j . defined in terms of nodes 
and links, and Simon's model, defined in terms of balls and pins, as was established in |BE01) . 
Essentially, the correspondence is obtained by noting that the balls in an urn can be viewed 
as an equivalence class of nodes all having the same indegree (or outdegree). 

We assume a countable number of urns, urni,urn2,urn3, . . . , where each ball in urui is 
assumed to have i pins attached to it. Initially all the urns are empty except urni, which has 
one ball in it. Let Fi(k) be the number of balls in urui at stage k of the stochastic process, 
so -Fi(l) = 1. Then, for k > 1, at stage + 1 of the stochastic process one of two things may 
occur: 

(i) with probability p, < p < 1, a new ball is inserted into urui, or 

(ii) with probability 1 — p an urn is selected, with urni being selected with probability 
proportional to iFi{k), and a ball is chosen from urnf, then, 

(a) with probability r, < r < 1, the chosen ball is transferred to nrnj+i, (this is 
equivalent to attaching an additional pin to the ball chosen from urni), or 

(b) with probability 1 — r, the ball is transferred to urni^i if i > 1, otherwise, if i = 1, 
the ball is discarded (this is equivalent to removing and discarding a pin from the 
ball chosen from urni). 

In the special case when r = 1, the process reduces to Simon's original model. 

We note that choosing a ball preferentially (in proportion to the number of pins) is equiv- 
alent to selecting a pin uniformly at random and choosing the ball it is attached to. Thus, 
with probability (1 — ^')(1 — r), a pin chosen uniformly at random is discarded. 
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Since iFi(k) is the total number of pins attached to balls in urrii, the expected total 
number of pins in the urns at stage k is given by 



^(^iF,(fc)) = l + {k-l){p+{l-p)r-{l-p){l-r)) 

= l + (A;-l)(l-2(l-p)(l-r)). (2) 



i=l 



Correspondingly, the expected total number of balls in the urns is given by 

k k-l 

E{J2F'^{k)) = l + {k-l)p-{l-p)il-r)Y,<P,, (3) 
i=i j=i 

where (/ij, 1 < i < — 1, is the expected probability of choosing urrii at step (ii) of stage 
j + 1, i.e. 

Now let 

<P^'^ = li:^r (5) 
i=i 

In order to ensure that there are at least as many pins in the system as there are balls 
and that, on average, more balls are added to the system than are removed, we require the 
following constraint, derived from and 

< < „ f,, (6) 

1 — r [1 — p)[l — r) 

This implies 

2(l-p)(l-r) < 1, (7) 

which obviously holds for r > 1/2. Inequality Q expresses the fact that the expected number 
of pins should not be negative, and follows from 0. 

We note that we could modify the initial conditions so that, for example, urni initially 
contained 6 balls, 5 > 1, instead of just one ball. It can be shown, from the development of the 
model below, that any change in the initial conditions will have no effect on the asymptotic 
distribution of the balls in the urns as k tends to infinity, provided the process does not 
terminate with all of the urns empty. More specifically, the probability that the process will 
not terminate with all the urns empty is given by 

/ (l-p)(l-r) y 



This is exactly the probability that the gambler's fortune will increase forever |R,os83j . 
(We note that (jH)) only makes sense if ((TJ holds.) Since, in practice, r will be quite close to 
one, once the process has survived for a few steps the probability that it will subsequently 
terminate is small. From now on we will therefore assume that the process does not terminate. 
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3 Derivation of the Steady State Distribution 

Following Simon |Sim55j . we now state the mean- field equations for the urn transfer model. 
For i > 1 the expected number of balls in urrii is given by 

Ek{F,{k + 1)) = F,{k) + (3k{r{i - m^k) + (1 - r)(i + l)Fi+i{k) - iF,{k)), (9) 



where Ek{Fi{k + 1)) is the expected value of Fi{k + 1), given the state of the model at stage 
k, and 

1 — p 
EfUF\{k) 

is the required normalising factor. 



Pk = ^ (10) 



In the boundary case, when i = 1, we have 

Ek{Fi{k + l)) = Fi{k)+p + Pk{{l-r) 2F2{k)-Fi{k)), (11) 

for the expected number of balls in urrii, given the state at stage k. 

In order to obtain a solution for the model, we assume that, for large k, the random 
variable (3k can be approximated by a constant value Pk depending only on k. This is defined 

by 

fc(l-2(l-p)(l-r))- 



The motivation for this approximation is that the denominator in the definition of /3fc, the 
total number of pins, has been replaced by its expectation given in 0. This is a reasonable 
assumption since the number of pins is the difference between two binomial random variables, 
and with high probability this will be close to its expected value. We observe that using 
as the normalising factor instead of Pk results in an approximation similar to that of the "p^ 
model" in |LFLW()2j . 

Replacing P^ by Pk and taking the expectations of @ and ((TT|) . we obtain 
E{Fi{k + l)) = E(F,(A:))+/3fc(r(i-l)^(F,_i(A:)) + (l-r)(i + l)S(Fi+i(A:))-i£;(F,(A;))) (12) 
and 

E{Fi{k + 1)) = E{Fi{k)) +p + 4((l - r) 2E{F2{k)) - E{Fi{k))), (13) 

respectively. 

In order to solve (|12j) and (|1H|) . we would like to show that E{Fi{k))/k tends to a limit 
fi as k tends to infinity. Suppose for the moment that this is the case, then, provided the 
convergence is fast enough, E{Fi{k + 1)) — E{Fi{k)) tends to fi. By "fast enough" we mean 
that ei^k+i ~ ^i,k is o{\/k) for large k, where 

E{Fi{k))=k{U + e^,k). 
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Now, letting 

'^ = '^'^^- l-2a-rt(l-r) - 

we see that l3kE{Fi{k)) tends to (3fi as k tends to infinity. Thus, letting k tend to infinity in 
(O and (O, these yield 

fi = p(r{i - + (1 - r)(z + !)/,+! - if,), (15) 

and 

/i=p + /3((l-r)2/2-/i), (16) 

respectively. 

We now investigate the asymptotic behaviour of fi for large i. If we define gi to be ifi, 
we can rewrite (|15|) and (|16j) as 

^ + ^^^^^ = 1 + ^ (17) 

and 

JL+(l^ = l+l (18) 



Suppose we can write gi as 



where a, r/, p and C are constants. 

Using the binomial theorem and the fact that T{x + 1) = xT{x), we get 



^^^fi + of^ 

I + a \ V * 



Similarly, 



i + a + p+ l\ \i 

l-^ + ^^"+f+^^ 0fl). (21) 



Substituting (|2Ujl and (|21|) into the left-hand side of p7j) . we obtain 



7 



which simpHfies to 



1 + ^(2r - 1) + - r){p + 1) - (2r - l)a) + O (^1) = 1 + 1. (22) 



For this to hold for all large enough values of i, we require 

1 

/?(2r - 1) 



P=^7T7^^' (23) 



and 

2r - 1 ^ ^ 

It is straightforward to obtain more terms in the expansion of ()19() by a more detailed 
analysis, for example 

_ p ((1 - r)(a + p + 1)2 - ra^) 
^ ~ 2(2r - 1) ■ 

So, with /9 and a as in (^5]) and ((211), for large i, ((T^ gives an approximate solution to 
the recurrence defined by (fTK)) and (|T6|) . Thus, 

(25) 

where C is independent of z and ~ means is asymptotic to. 
From (ESI) and (dH), 

p=l + f^V^V (26) 



1-pJ \2r-l 

If, as is usually the case, we assume that p is positive, then r > 1/2, and it follows that 
increasing p increases p, whereas increasing r decreases p. In particular, we observe from ()26() 
that with pin removal, i.e. for r < 1, the exponent of the power law is greater than it would 
be for r = 1, i.e. 



4 Recognising Pin Removal 

In this section we show that, using the mean- field equations derived in Section O we are able 
to detect whether pin removal has taken place by inspecting a static snapshot of the system 
at the steady state. 

Based on ^ and ((SJ we have 

^« 1-2(1 -p)(l-r), (28) 
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balls / s / ^ / ^ 

(29) 

where pins and balls stand for the expected numbers of pins and balls at stage k, and (p is 
the asymptotic value of defined by ©, i.e. the expected proportion of pins in the first 
urn, \urni\/pins. The right-hand sides of ()28|) and H29|) give asymptotic values oi pins and 
balls as k tends to infinity. It follows that the asymptotic value of the ratio balls /pins, which 
we denote by A, is given by 

p- (1 - r)0 



l-2(l-p)(l-r) 
From this and the fact that A > it follows that 



(30) 



< <A< . (31) 

~ 1 - (1 -p)(l - r) - - 1 - 2(1 -p)(l - r) ^ ^ 



Using ((TJ and (HEJ, this implies 



|<P<A<^. (32) 
2 p 



Now suppose we are given p, A and (p. Then from and (PU)) we can derive the following 
equations: 

P - - (33) 

r = Ul + .^. .,. V (34) 



2 V 2/9(1 - A) - 2 + 

Now suppose we observe an urn process with 1/2 < r < 1 and < p < 1, and we take a 
snapshot of the system at the steady state, obtaining empirical estimates for p, A and (p from 
the distribution of the balls in the urns. We would like to check whether it is possible that 
these values could have arisen from Simon's process, i.e. with r = 1, with the probability of 
inserting a ball into the first urn equal to, say, p'. So from (|26j) and (|3n|) we would require 
p = 1/(1 — p') and A = p' . Thus, from H26|) . we would obtain 

= 1-2(1 -p)(l-r)' ^^^^ 



and, from (jJOJ, 



/ . p - {I - p){l - r)(j) 
^=^=l-2(l-p)(l-r)- 



It is evident that H35|) and ()36|) can only be consistent if r ~ 1. Thus, simulating the urn 
process with p' = {p — l)//5 and r = 1 would result in the same value for p as that obtained 
from p and the original value of r. However, in the presence of pin removal, the value of A (i.e. 
the asymptotic value of balls /pins) would be less than p', the probability of inserting a new 
ball into the first urn. This provides a discriminator between the two processes. As a result, 
by examining the said snapshot we are able to ascertain whether the process is consistent 
with the urn model in which pins may be discarded, i.e. with r < 1. We apply this analysis 
in the next section to the degree distribution of inlinks in the web graph. 
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5 A Model for the Evolution of the Web Graph 



We now describe a discrete stochastic process for simulating the evolution of the degree 
distribution of inlinks in the web graph. In this model, balls correspond to web pages and 
pins correspond to inlinks. At each time step the state of the web graph is a directed graph 
G = {N,E), where is its node set and E is its arc set. In this scenario Fi{k), i > 1, is 
the number of pages (nodes) in the web graph having i inlinks. We note that, although we 
have chosen i to denote the number of inlinks, i could alternatively denote the number of 
outlinks, the number of pages in a web site, or any other reasonable parameter we would like 
to investigate; see |LI^'LW02] for further details. 

Consider the evolution of the web graph with respect to the number of pages having i 
inlinks at the kth step of the process. Initially G contains just a single page. At each step one 
of three things can happen. First, with probability p a new page having one incoming link is 
added to G; this is equivalent to placing a new ball in urrii. Alternatively, with probability 
1 — p a page is chosen, the probability of choosing a given page being proportional to i, the 
number of inlinks the page currently has; this is equivalent to preferentially choosing a ball 
from urrii. Then, with probability r the chosen page receives a new inlink; this is equivalent 
to transferring the chosen ball from urrii to urui^i. Alternatively, with probability 1 — r an 
inlink to the page is removed, and if the page has no inlinks remaining it is removed from 
the graph; this is equivalent to transferring the ball to urrii-i when i> 1, and discarding the 
ball from the system \ii = 1. 

As a proof of concept, we use the inlinks data from a large crawl of the web performed 



during May 1999 |BKM^flnj : the data set we used in this analysis was obtained from Ravi 
Kumar at IBM. After removing web pages with zero inlinks from the data set, there remain 
approximately 177 million pages with a total of 1.466 billion inlinks. Using the ranked ap- 
proach, linear regression on the log-log transformed data gives an exponent of 2.1052, which is 



consistent with the value of 2.1 reported in BKM^OOj and confirmed subsequently in |AB02j : 
see also |DKM+02| . 

As discussed in the introduction, we have some reservations about the use of the ranked 
approach, as it uses ranks rather than values, thereby ignoring the gaps. In Broder et al.'s 
May 1999 data set, the first gap appears at 3121, i.e. there were no pages found with 3121 
inlinks. Two possible reasons for such a gap are: (i) there were no pages in the web having 
3121 inlinks, or (ii) there were one or more such pages but the crawl did not cover these pages. 
In any case, there is no reason to believe that the web will not contain such a page in the 
future. Moreover, such gaps, which are inherent in preferential attachment models, change 
over time due to the growth of the web graph and the stochastic nature of the evolutionary 
process. 

One way to avoid this issue is to use the unranked approach and carry out the regression 
only on the data values preceding the first gap. Using the first 3120 data values of the inlinks 
data set, the regression yields an exponent of 2.1535. In Figure Q we show the two regression 
lines, which have negative slopes of —2.1052, when using the ranked approach, and —2.1535, 
when using the unranked approach. (Recall that the exponent of the power law is 1 -|- p.) 

From the inlinks data we then estimated A and (f), obtaining A ~ halls /pins = 0.1205 and 
(f) ~ \urni\/pins = 0.0553. Using these estimates for A and (j), and p = 1.1535, we computed 
p and r from (|33|) and (jHU, respectively, obtaining p = 0.0915 and r = 0.8280. From the 
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May 1 999 web data 




10° 10^ 10= 10= lo" 10= 

rank of number of inlinl^s 

Figure 1: Broder et al. May 1999 inlink data 

analysis in Section HI it is evident that link deletion is taking place in the web graph, since 
r < 1. The extent of link deletion indicated as a proportion of insertions and deletions is 
(1 — p){l — r) = 0.156. It would be interesting to check this value empirically by looking at 
sequential snapshots of the web. 

To validate our mean-field analysis we conducted 10 simulation runs each for k = 10^ 
steps with the above values of p and r. Over the 10 runs, the mean for A was 0.1197 with 
standard deviation 1.1 x 10^'^, and computing (j) as \urni\/pins gave a mean of 0.0617 with 
standard deviation 9.2 x 10~^. However, if we compute cj) from H29|) the mean is 0.0589 with 
standard deviation 4.9 x 10~^, which is closer to the empirical value of 0.0553. We suggest 
that the difference between the two estimates for cj) is mainly due to the slow convergence of 
\urni\/pins. 

Lastly, we computed p from the simulation results as 

pins 
pins — kp ' 

which follows from (|26|) and (|28|) . This gave a mean value for p of 1.1534, with standard 
deviation 5.6 x 10"^, compared to the value of 1.1535 computed from the mean-field equation 

For regression purposes 10 million simulation steps are insufficient to get close to the 
asymptotic value of p, so we ran two additional simulations of 1 billion steps to compare the 
exponents obtained using the ranked and unranked approaches. For efficiency reasons, we 
modified the simulation by only considering the first 7000 urns, approximating the effect at 
the upper boundary. The simulation was implemented in Matlab, running on a desktop PC 
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with a 1485Mhz Intel Pentium 4 processor and 500MB of RAM, on a Windows 2000 platform. 
These 1 billion step simulations, with 7000 urns, each took over 300 hours. (We conducted 
several more runs of 1 billion steps, varying the numbers of urns, to validate the robustness 
of the approximation; a single run of 2 billion steps with 15000 urns gave similar results to 
those we report below.) 

For the first (second) run with 7000 urns, we found that the first empty urn was urn 
number 2403 (2395) and that overall there were 5859 (5914) non-empty urns. For the unranked 
approach, linear regression on the log-log transformed data for the first 2402 (2394) urns gave 
an exponent of 2.1603 (2.1558). On the other hand, for the ranked approach, regression on all 
the non-empty urns gave an exponent of 2.1199 (2.1095). In summary, the simulations of our 
stochastic model are consistent with the inlinks data set, highlighting a small but noticeable 
difference in the exponent depending on whether the ranked or unranked approach is used. 
In fact, with our evolutionary model of the web graph, in common with all others, there is 
no easy way of handling the gaps and, moreover, the concept of gaps has no meaning in the 
context of an asymptotic mean field analysis. 



6 Concluding Remarks 

We have presented an extension of Simon's classical stochastic process that allows for pins 
(which might represent web links) to be discarded, and have shown that asymptotically it 
still follows a power-law distribution. Given a snapshot of a system, the mean-field equations 
that we have derived give estimates of the parameters p and r, which can then be input to 
our stochastic process simulating the evolution of the system. We applied our analysis to the 
May 1999 web crawl data BKM"'"00] to detect the extent to which link deletion had taken 



place. The values of p and r that we have obtained indicate that approximately 15% of all 
link operations are deletions. We also ran a number of simulations to validate the mean-field 
analysis. 

An interesting finding that came to light when analysing the data was that there is good 
evidence that the exponent of the power law for inlinks is in fact close to 2.15 rather than 
to the widely published value of 2.1. Although the difference between these exponents is 
small, we consider it to be significant because it suggests a more justifiable way to use linear 
regression to obtain exponents from power law data. 

It would be interesting to study link deletion through historical data, such as that provided 
by the wayback machine |Not02j (http://www.waybackmachine.org), in order to gain more 
insight into the dynamic aspects of the evolution of the web graph. In particular, it would 
be desirable to determine whether and how the exponent, and the parameters p and r, are 
changing over time. 
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